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Abstract 

A  striking  feature  of  human  syntactic  pro¬ 
cessing  is  that  it  is  context-dependent ,  that 
is,  it  seems  to  take  into  account  seman¬ 
tic  information  from  the  discourse  con¬ 
text  and  world  knowledge.  In  this  paper, 
we  attempt  to  use  this  insight  to  bridge 
the  gap  between  SRL  results  from  gold 
parses  and  from  automatically-generated 
parses.  To  do  this,  we  jointly  perform 
parsing  and  semantic  role  labeling,  using 
a  probabilistic  SRL  system  to  rerank  the 
results  of  a  probabilistic  parser.  Our  cur¬ 
rent  results  arc  negative,  because  a  locally- 
trained  SRL  model  can  return  inaccurate 
probability  estimates. 

1  Introduction 

Although  much  effort  has  gone  into  developing 
statistical  parsing  models  and  they  have  improved 
steadily  over  the  years,  in  many  applications  that 
use  parse  trees  errors  made  by  the  parser  arc  a  ma¬ 
jor  source  of  errors  in  the  final  output.  A  promising 
approach  to  this  problem  is  to  perform  both  pars¬ 
ing  and  the  higher-level  task  in  a  single,  joint  prob¬ 
abilistic  model.  This  not  only  allows  uncertainty 
about  the  parser  output  to  be  carried  upward,  such 
as  through  an  A: -best  list,  but  also  allows  informa¬ 
tion  from  higher-level  processing  to  improve  pars¬ 
ing.  For  example.  Miller  et  al.  (2000)  showed  that 
performing  parsing  and  information  extraction  in  a 
joint  model  improves  performance  on  both  tasks.  In 
particular,  one  suspects  that  attachment  decisions, 
which  arc  both  notoriously  hard  and  extremely  im¬ 
portant  for  semantic  analysis,  could  benefit  greatly 
from  input  from  higher-level  semantic  analysis. 

The  recent  interest  in  semantic  role  labeling  pro¬ 
vides  an  opportunity  to  explore  how  higher-level  se¬ 
mantic  information  can  inform  syntactic  parsing.  In 


previous  work,  it  has  been  shown  that  SRL  systems 
that  use  full  parse  information  perform  better  than 
those  that  use  shallow  parse  information,  but  that 
machine-generated  parses  still  perform  much  worse 
than  human-corrected  gold  parses. 

The  goal  of  this  investigation  is  to  narrow  the  gap 
between  SRL  results  from  gold  parses  and  from  au¬ 
tomatic  parses.  We  aim  to  do  this  by  jointly  perform¬ 
ing  parsing  and  semantic  role  labeling  in  a  single 
probabilistic  model.  In  both  parsing  and  SRL,  state- 
of-the-art  systems  arc  probabilistic;  therefore,  their 
predictions  can  be  combined  in  a  principled  way  by 
multiplying  probabilities.  In  this  paper,  we  rerank 
the  A; -best  parse  trees  from  a  probabilistic  parser  us¬ 
ing  an  SRL  system.  We  compare  two  reranking  ap¬ 
proaches,  one  that  linearly  weights  the  log  proba¬ 
bilities,  and  the  other  that  learns  a  reranker  over 
parse  trees  and  SRL  frames  in  the  manner  of  Collins 
(2000). 

Currently,  neither  method  performs  better  than 
simply  selecting  the  top  predicted  parse  Lee.  We 
discuss  some  of  the  reasons  for  this;  one  reason  be¬ 
ing  that  the  ranking  over  parse  trees  induced  by  the 
semantic  role  labeling  score  is  unreliable,  because 
the  model  is  trained  locally. 

2  Base  SRL  System 

Our  approach  to  joint  parsing  and  SRL  begins  with 
a  base  SRL  system,  which  uses  a  standard  architec¬ 
ture  from  the  literature.  Our  base  SRL  system  is  a 
cascade  of  maximum-entropy  classifiers  which  se¬ 
lect  the  semantic  argument  label  for  each  constituent 
of  a  full  parse  tree.  As  in  other  systems,  we  use 
three  stages:  pruning,  identification,  and  classifica¬ 
tion.  First,  in  pruning,  we  use  a  deterministic  pre¬ 
processing  procedure  introduced  by  Xue  and  Palmer 
(2004)  to  prune  many  constituents  which  are  almost 
certainly  not  arguments.  Second,  in  identification, 
a  binary  MaxEnt  classifier  is  used  to  prune  remain¬ 
ing  constituents  which  are  predicted  to  be  null  with 
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Base  features  [GJ02J 
Path  to  predicate 
Constituent  type 
Head  word 
Position 
Predicate 

Head  POS  [SHWA03] 

All  conjunctions  of  above 

Table  1 :  Features  used  in  base  identification  classi¬ 
fier. 

high  probability.  Finally,  in  classification ,  a  multi¬ 
class  MaxEnt  classifier  is  used  to  predict  the  argu¬ 
ment  type  of  the  remaining  constituents.  This  clas- 
sifer  also  has  the  option  to  output  Null. 

It  can  happen  that  the  returned  semantic  argu¬ 
ments  overlap,  because  the  local  classifiers  take  no 
global  constraints  into  account.  This  is  undesirable, 
because  no  overlaps  occur  in  the  gold  semantic  an¬ 
notations.  We  resolve  overlaps  using  a  simple  recur¬ 
sive  algorithm.  For  each  parent  node  that  overlaps 
with  one  of  its  descendents,  we  check  which  pre¬ 
dicted  probability  is  greater:  that  the  parent  has  its 
locally-predicted  argument  label  and  all  its  descen¬ 
dants  arc  null,  or  that  the  descendants  have  their  op¬ 
timal  labeling,  and  the  parent  is  null.  This  algorithm 
returns  the  non-overlapping  assignment  with  glob¬ 
ally  highest  confidence.  Overlaps  arc  uncommon, 
however;  they  occurred  only  68  times  on  the  1346 
sentences  in  the  development  set. 

We  train  the  classifiers  on  PropBank  sections  02- 
21.  If  a  true  semantic  argument  fails  to  match 
any  bracketing  in  the  parse  tree,  then  it  is  ignored. 
Both  the  identification  and  classification  models  arc 
trained  using  gold  parse  trees.  All  of  our  features  arc 
standard  features  for  this  task  that  have  been  used 
in  previous  work,  and  arc  listed  in  Tables  1  and  2. 
We  use  the  maximum-entropy  implementation  in  the 
Mallet  toolkit  (McCallum,  2002)  with  a  Gaussian 
prior  on  parameters. 

3  Reranking  Parse  Trees  Using  SRL 
Information 

Here  we  give  the  general  framework  for  the  rerank¬ 
ing  methods  that  we  present  in  the  next  section.  We 
write  a  joint  probability  model  over  semantic  frames 
F  and  parse  trees  t  given  a  sentence  x  as 

P(F,t\x)=p(F\t,x)p(t\x),  (1) 

where  p(t  |x)  is  given  by  a  standard  probabilistic 
parsing  model,  and  p(F\t,x)  is  given  by  the  base¬ 
line  SRL  model  described  previously. 


Base  features  [GJ02J 

Head  word 

Constituent  type 

Position 

Predicate 

Voice 

Head  POS  [SHWA03] 

From  [PWHMJ04J 
Parent  Head  POS 
First  word  /  POS 
Last  word  /  POS 

Sibling  constituent  type  /  head  word  /  head  POS 
Conjunctions  [XP03] 

Voice  &  Position 
Predicate  &  Head  word 
Predicate  &  Constituent  type 


Table  2:  Features  used  in  baseline  labeling  classifier. 


Parse  Trees  Used 

SRL  LI 

Gold 

77.1 

1-best 

63.9 

Reranked  by  gold  parse  LI 

68.1 

Reranked  by  gold  frame  LI 

74.2 

Simple  SRL  combination  (a  =  0.5) 

56.9 

Chosen  using  trained  reranker 

63.6 

Table  3:  Comparison  of  Overall  SRL  LI  on  devel¬ 
opment  set  by  the  type  of  parse  trees  used. 

In  this  paper,  we  choose  (  F* ,  t*)  to  approximately 
maximize  the  probability  p(F,  t |x)  using  a  reranking 
approach.  To  do  the  reranking,  we  generate  a  list  of 
A; -best  parse  trees  for  a  sentence,  and  for  each  pre¬ 
dicted  tree,  we  predict  the  best  frame  using  the  base 
SRL  model.  This  results  in  a  list  {( Fl ,  t *)}  of  parse 
tree  /  SRL  frame  pairs,  from  which  the  reranker 
chooses.  Thus,  our  different  reranking  methods  vary 
only  in  which  parse  tree  is  selected;  given  a  parse 
tree,  the  frame  is  always  chosen  using  the  best  pre¬ 
diction  from  the  base  model. 

The  A’ -best  list  of  parses  is  generated  using  Dan 
Bikel’s  (2004)  implementation  of  Michael  Collins’ 
parsing  model.  The  parser  is  trained  on  sections  2- 
21  of  the  WSJ  Treebank,  which  does  not  overlap 
with  the  development  or  test  sets.  The  A -best  list  is 
generated  in  Bikel’s  implementation  by  essentially 
turning  off  dynamic  programming  and  doing  very 
aggressive  beam  search.  We  gather  a  maximum  of 
500  best  parses,  but  the  limit  is  not  usually  reached 
using  feasible  beam  widths.  The  mean  number  of 
parses  per  sentence  is  176. 

4  Results  and  Discussion 

In  this  section  we  present  results  on  several  rerank¬ 
ing  methods  for  joint  parsing  and  semantic  role  la- 


beling.  Table  3  compares  FI  on  the  development  set 
of  our  different  reranking  methods.  The  first  four 
rows  in  Table  3  are  baseline  systems.  We  present 
baselines  using  gold  trees  (row  1  in  Table  3)  and 
predicted  trees  (row  2).  As  shown  in  previous  work, 
gold  trees  perform  much  better  than  predicted  trees. 

We  also  report  two  cheating  baselines  to  explore 
the  maximum  possible  performance  of  a  reranking 
system.  First,  we  report  SRL  performance  of  ceil¬ 
ing  pai'sc  trees  (row  3),  i.e.,  if  the  parse  tree  from  the 
A; -best  list  is  chosen  to  be  closest  to  the  gold  tree. 
This  is  the  best  expected  performance  of  a  parse 
reranking  approach  that  maximizes  parse  FI.  Sec¬ 
ond,  we  report  SRL  performance  where  the  parse 
tree  is  selected  to  maximize  SRL  FI,  computing 
using  the  gold  frame  (row  4).  There  is  a  signifi¬ 
cant  gap  both  between  parse-Fl -reranked  trees  and 
SRL-F1  -reranked  trees,  which  shows  promise  for 
joint  reranking.  However,  the  gap  between  SRL- 
Fl-reranked  trees  and  gold  parse  trees  indicates  that 
reranking  of  parse  lists  cannot  by  itself  completely 
close  the  gap  in  SRL  performance  between  gold  and 
predicted  parse  trees. 

4.1  Reranking  based  on  score  combination 

Equation  1  suggests  a  straightforward  method  for 
reranking:  simply  pick  the  parse  tree  from  the  fc-best 
list  that  maximizes  p(F,  t|x),  in  other  words,  add  the 
log  probabilities  from  the  parser  and  the  base  SRL 
system.  More  generally,  we  consider  weighting  the 
individual  probabilities  as 

s(F1t)=p(F\t,x)1-ap(t\x)a.  (2) 

Such  a  weighted  combination  is  often  used  in  the 
speech  community  to  combine  acoustic  and  lan¬ 
guage  models. 

This  reranking  method  performs  poorly,  however. 
No  choice  of  a  performs  better  than  a  =  1,  i.e., 
choosing  the  1-best  predicted  parse  tree.  Indeed,  the 
more  weight  given  to  the  SRL  score,  the  worse  the 
combined  system  performs.  The  problem  is  that  of¬ 
ten  a  bad  parse  tree  has  many  nodes  which  are  obvi¬ 
ously  not  constituents:  thus  p(F\t,  x)  for  such  a  bad 
tree  is  very  high,  and  therefore  not  reliable.  As  more 
weight  is  given  to  the  SRL  score,  the  unlabeled  re¬ 
call  drops,  from  55%  when  a  =  0  to  71%  when 
a  =  1.  Most  of  the  decrease  in  FI  is  due  to  the  drop 
in  unlabeled  recall. 

4.2  Training  a  reranker  using  global  features 

One  potential  solution  to  this  problem  is  to  add 
features  of  the  entire  frame,  for  example,  to  vote 


against  predicted  frames  that  arc  missing  key  argu¬ 
ments.  But  such  features  depend  globally  on  the  en¬ 
tire  frame,  and  cannot  be  represented  by  local  clas¬ 
sifiers.  One  way  to  train  these  global  features  is  to 
learn  a  1  i near  classifier  that  selects  a  parse  /  frame 
pair  from  the  ranked  list,  in  the  manner  of  Collins 
(2000).  Reranking  has  previously  been  applied  to 
semantic  role  labeling  by  Toutanova  et  al.  (2005), 
from  which  we  use  several  features.  The  difference 
between  this  paper  and  Toutanova  et  al.  is  that  in¬ 
stead  of  reranking  fc-best  SRL  frames  of  a  single 
parse  tree,  we  are  reranking  1-best  SRL  frames  from 
the  A’-bcst  parse  trees. 

Because  of  the  the  computational  expense  of 
training  on  A: -best  parse  tree  lists  for  each  of  30,000 
sentences,  we  train  the  reranker  only  on  sections  15- 
18  of  the  Treebank  (the  same  subset  used  in  previ¬ 
ous  CoNLL  competitions).  We  train  the  reranker 
using  LogLoss,  rather  than  the  boosting  loss  used 
by  Collins.  We  also  restrict  the  reranker  to  consider 
only  the  top  25  parse  trees. 

This  globally-trained  reranker  uses  all  of  the  fea¬ 
tures  from  the  local  model,  and  the  following  global 
features:  (a)  sequence  features,  i.e.,  the  linear  se¬ 
quence  of  argument  labels  in  the  sentence  (e.g. 
A0_V_A1),  (b)  the  log  probability  of  the  parse  tree, 
(c)  has-arg  features,  that  is,  for  each  argument  type 
a  binary  feature  indicating  whether  it  appeal's  in  the 
frame,  (d)  the  conjunction  of  the  predicate  and  has- 
arg  feature,  and  (e)  the  number  of  nodes  in  the  tree 
classified  as  each  argument  type. 

The  results  of  this  system  on  the  development  set 
are  given  in  Table  3  (row  6).  Although  this  performs 
better  than  the  score  combination  method,  it  is  still 
no  better  than  simply  taking  the  1-best  parse  tree. 
This  may  be  due  to  the  limited  training  set  we  used 
in  the  reranking  model.  A  base  SRL  model  trained 
only  on  sections  15-18  has  61.26  FI,  so  in  com¬ 
parison,  reranking  provides  a  modest  improvement. 
This  system  is  the  one  that  we  submitted  as  our  offi¬ 
cial  submission.  The  results  on  the  test  sets  are  given 
in  Table  4. 

5  Summing  over  parse  trees 

In  this  section,  we  sketch  a  different  approach  to 
joint  SRL  and  parsing  that  does  not  use  rerank¬ 
ing  at  all.  Maximizing  over  parse  trees  can  mean 
that  poor  parse  trees  can  be  selected  if  their  se¬ 
mantic  labeling  has  an  erroneously  high  score.  But 
we  are  not  actually  interested  in  selecting  a  good 
parse  tree;  all  we  want  is  a  good  semantic  frame. 
This  means  that  we  should  select  the  semantic  frame 


Precision 

Recall 

P/3=l 

Development 

64.43% 

63.11% 

63.76 

Test  WSJ 

68.57% 

64.99% 

66.73 

Test  Brown 

62.91% 

54.85% 

58.60 

Test  WSJ+Brown 

67.86% 

63.63% 

65.68 

Test  WSJ 

Precision 

Recall 

P/3=l 

Overall 

68.57% 

64.99% 

66.73 

AO 

69.47% 

74.35% 

71.83 

Al 

66.90% 

64.91% 

65.89 

A2 

64.42% 

61.17% 

62.75 

A3 

62.14% 

50.29% 

55.59 

A4 

72.73% 

70.59% 

71.64 

A5 

50.00% 

20.00% 

28.57 

AM-ADV 

55.90% 

49.60% 

52.57 

AM-CAU 

76.60% 

49.32% 

60.00 

AM-DIR 

57.89% 

38.82% 

46.48 

AM— DIS 

79.73% 

73.75% 

76.62 

AM-EXT 

66.67% 

43.75% 

52.83 

AM-LOC 

50.26% 

53.17% 

51.67 

AM-MNR 

54.32% 

51.16% 

52.69 

AM-MOD 

98.50% 

95.46% 

96.96 

AM-NEG 

98.20% 

94.78% 

96.46 

AM— PNC 

46.08% 

40.87% 

43.32 

AM-PRD 

0.00% 

0.00% 

0.00 

AM-REC 

0.00% 

0.00% 

0.00 

AM-TMP 

72.15% 

67.43% 

69.71 

R-A0 

0.00% 

0.00% 

0.00 

R— Al 

0.00% 

0.00% 

0.00 

R-A2 

0.00% 

0.00% 

0.00 

R— A3 

0.00% 

0.00% 

0.00 

R-A4 

0.00% 

0.00% 

0.00 

R— AM-ADV 

0.00% 

0.00% 

0.00 

R-AM-CAU 

0.00% 

0.00% 

0.00 

R-AM-EXT 

0.00% 

0.00% 

0.00 

R-AM-LOC 

0.00% 

0.00% 

0.00 

R— AM-MNR 

0.00% 

0.00% 

0.00 

R- AM-TMP 

0.00% 

0.00% 

0.00 

V 

99.21% 

86.24% 

92.27 

Table  4:  Overall  results  (top)  and  detailed  results  on 
the  WSJ  test  (bottom). 


that  maximizes  the  posterior  probability:  p(F |x)  = 
p(.F|t,x)p(f|x).  That  is,  we  should  be  sum¬ 
ming  over  the  parse  trees  instead  of  maximizing  over 
them.  The  practical  advantage  of  this  approach  is 
that  even  if  one  seemingly-good  parse  tree  does  not 
have  a  constituent  for  a  semantic  argument,  many 
other  parse  trees  in  the  fc-best  list  might,  and  all 
are  considered  when  computing  F* .  Also,  no  sin¬ 
gle  parse  tree  need  have  constituents  for  all  of  F* ; 
because  it  sums  over  all  parse  trees,  it  can  mix  and 
match  constituents  between  different  trees.  The  op¬ 
timal  frame  F*  can  be  computed  by  an  0(N 3)  pars¬ 
ing  algorithm  if  appropriate  independence  assump¬ 
tions  are  made  on  p(F |x).  This  requires  designing 
an  SRL  model  that  is  independent  of  the  bracketing 
derived  from  any  particular-  parse  tree.  Initial  experi¬ 
ments  performed  poorly  because  the  marginal  model 
p(F|x)  was  inadequate.  Detailed  exploration  is  left 
for  future  work. 


6  Conclusion  and  Related  Work 

In  this  paper,  we  have  considered  several  methods 
for  reranking  parse  frees  using  information  from  se¬ 
mantic  role  labeling.  So  far,  we  have  not  been 
able  to  show  improvement  over  selecting  the  1-best 
parse  tree.  Gildea  and  Jurafsky  (Gildea  and  Jurafsky, 
2002)  also  report  results  on  reranking  parses  using 
an  SRL  system,  with  negative  results.  In  this  paper, 
we  confirm  these  results  with  a  MaxEnt-trained  SRL 
model,  and  we  extend  them  to  show  that  weighting 
the  probabilities  does  not  help  either. 

Our  results  with  Collins-style  reranking  are  too 
preliminary  to  draw  definite  conclusions,  but  the  po¬ 
tential  improvement  does  not  appeal-  to  be  great.  In 
future  work,  we  will  explore  the  max-sum  approach, 
which  has  promise  to  avoid  the  pitfalls  of  max-max 
reranking  approaches. 
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